KAMD: A Progress Estimator for MapReduce Pipelines
نویسندگان
چکیده
Limited user-feedback exists in cluster computing environments such as MapReduce. Accurate, time-oriented progress indicators could provide much utility to users in this domain, where job execution times can have high variance due to the amount of data being processed, the amount of parallelism available, and the types of operators (often user-defined) that perform the processing. This feedback would help users make informed decisions, such as whether a job should be terminated and restarted at a later time when the cluster has more resources available. However, none of the techniques used by existing tools or available in the literature provide a non-trivial progress indicator for queries running in a distributed environment. In this paper, we apply recently developed techniques for estimating the progress of single-site SQL queries to parallel environments. In particular, we target environments where queries consist of MapReduce job pipelines. We also present techniques that improve the accuracy and usefulness of progress estimators operating in this environment. We implemented our estimators in the Pig system and demonstrate its performance on experiments with real data (search logs) and with a real cluster.
منابع مشابه
Parallelizing XML Processing Pipelines via MapReduce
We present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that consume XML-structured data and produce, often through calls to “black-box” functions, modified (i.e., updated) XML structures. Our main contributions are a set of strategies for...
متن کاملParallelizing XML data-streaming workflows via MapReduce
In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the Map-Reduce framework. Pipelines in our approach consist...
متن کاملCloudflow - enabling faster biomedical pipelines with MapReduce and Spark
For many years Apache Hadoop has been used as a synonym for processing data in the MapReduce fashion. However, due to the complexity of developing MapReduce applications, adoption of this paradigm in genetics has been limited. To alleviate some of the issues, we have previously developed Cloudflow a high-level pipeline framework that allows users to create sophisticated biomedical pipelines usi...
متن کاملHalt or Continue: Estimating Progress of Queries in the Cloud
With cloud-based data management gaining more ground by day, the problem of estimating the progress of MapReduce queries in the cloud is of paramount importance. This problem is challenging to solve for two reasons: i) cloud is typically a large-scale heterogeneous environment, which requires progress estimation to tailor to non-uniform hardware characteristics, and ii) cloud is often built wit...
متن کاملMap Combine Map Task Split HDFS file K 1 , N 1 ( a ) Reduce Task { P 2 } { P 1 } { P 3 }
In parallel query-processing environments, accurate, time-oriented progress indicators could provide much utility to users given that queries take a very long time to complete and both interand intra-query execution times can have high variance. In these systems, query times depend on the query plans and the amount of data being processed, but also on the amount of parallelism available, the ty...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009